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ABSTRACT 

An  evaluation  of  short-period  (2^7-hr)  ceiling  and  visibility  terminal  fore¬ 
casting  techniques  indicates  that  it  is  possible  to  prepare  objective  forecasts  of 
these  critical  aviation  weather  parameters  which  yield  a  statistically  significant 
improvement  over  those  presently  provided  in  routine  operations. 

Two  types  of  forecasts  were  examined:  probability  forecasts  and  cate¬ 
gorical  forecasts.  Probability  forecasts  were  evaluated  by  means  of  the  Brier- 
Alien  P-score,  which  measures  the  sharpness  and  validity  of  probabilities. 

Four  objective  probability  forecast  procedures  and  special  subj ective  probabil¬ 
ity  forecasts  were  compared.  The  control  technique  was  a  c limatologic  al  procedure 
that  specifies  the  climatological  probability  of  the  events  conditional  on  the  initial 
condition,  the  season  of  the  year,  and  the  time  of  day.  This  is  called  climatological 
expectancy  of  persistence  (CEP).  Relative  to  this  technique,  the  best  technique  is 
multiple-discriminant  analysis  (MDA),  which  achieves  a  percentage  increase  rang¬ 
ing  from  7.9  to  12.5  for  ceiling  and  3.6  to  5.6  for  visibility. 

Categorical  forecasts  were  judged  by  the  Bryan  score,  which  measures  the 
skill  in  forecasting  operationally  important  categories  of  ceiling  and  visibility. 

The  control  forecasting  procedure  was  designated  as  persistence,  which  is  a 
specification  that  the  weather  will  remain  unchanged.  Forecasts  prepared  by  six 
procedures,  including  presently  available  operational  aviation  weather  forecasts 
prepared  subjectively,  were  evaluated.  Procedures  producing  probability  fore¬ 
casts  were  converted  to  categorical  forecasts  by  the  use  of  a  loss  function  that 
maximizes  the  Bryan  score. 

Relative  to  the  control  technique  (persistence),  the  MDA  procedure  yielded 
the  greatest  improvement  in  Bryan  score.  This  improvement  ranged  from  9.9  to 
14.7%  for  ceiling  and  13.3  to  19.2%  for  visibility.  Presently  available  subjective 
aviation  forecasts  yielded  an  increment  ranging  from  -3.2  to  14.7%  for  ceiling  and 
1.7  to  2 . 7%  for  visibility. 

Special  evaluations  of  experimental  subjective  probability  forecasts  revealed 
that  they  were  inferior  to  MDA  forecasts  in  terms  of  the  Brier-Alien  P-score;  but, 
when  converted  to  categorical  forecasts,  they  yielded  the  best  Bryan  scores  of  any 
forecast  technique  tested,  including  MDA  and  the  subjectively  prepared  categorical 
forecasts  available  in  routine  operations. 
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l.Q  INTRODUCTION 

The  design  of  the  Common  Aviation  Weather  System  (CAWS)  of  the  Federal  Avia¬ 
tion  Agency  [2]  specifies  important  requirements  for  weather-data  processing  techniques 
to  provide  short-period  predictions  of  terminal  weather  conditions.  As  part  of  the 
weather-data  processing  development  program  (Contract  FAA/BRIH363)  in  support  of 
CAWS,  The  Travelers  Research  Center  has  undertaken  an  extensive  program  to  test  and 
evaluate  certain  terminal  weather  forecasting  techniques  to  determine  their  suitability 
for  use  in  CAWS.  The  purpose  of  this  test  and  evaluation  program  was  to  compare  the 
accuracy  of  objective  techniques  with  one  another,  and  with  the  terminal  forecasts 
presently  available  on  a  routine  operational  basis  that  are  prepared  subjectively  by 
station  forecasters. 

The  design  of  the  Common  Aviation  Weather  System  calls  for  large  numbers  of 
terminal  forecasts  to  be  prepared  with  great  speed  to  meet  the  system  output  require¬ 
ments.  After  a  preliminary  evaluation  of  available  techniques  that  might  be  suitable  for 
this  purpose,  it  was  decided  to  focus  attention  at  this  time  on  the  engineering  of  objective 
statistical  procedures  that  are  readily  adaptable  to  machine  computation. 

This  decision  was  based  largely  on  two  considerations.  First,  it  was  felt  that  in 
the  light  of  our  knowledge  of  the  physical  processes  governing  small-scale  phenomena 
of  importance  in  aviation  terminal  weather  problems ,  techniques  based  upon  the  use  of 
the  physical  equations  would  require  an  extensive  and  indeterminate  developmental 
period,  and  would  be  best  considered  as  a  logical  follow-up  development  to  the  statistical 
procedures.  Second,  it  was  felt  that  techniques  that  could  provide  probabilistic  predic* 
tions  would  be  of  great  value  in  increasing  the  utility  of  terminal  weather  information, 
permitting  as  they  do  the  communication  of  the  degree  of  certainty  of  the  forecasts,  and 
providing  a  rational  basis  for  aviation  operational  and  planning  decisions.  This  report 
contains  a  description  of  the  extensive  test  program  and  a  summary  evaluation  of  the 
results. 


1.1  History  of  the  Test 


Originally,  the  test  and  evaluation  of  various  terminal  forecasting  techniques  was 
specified  as  a  requirement  of  the  weather  forecasting  technique  development  work  in 
support  of  Weather  Observing  and  Forecasting  System  433L,  Contract  AF  30 (63 5)- 14459. 
Subsequently,  the  Federal  Aviation  Agency  assumed  responsibility  for  certain  phases  of 
the  technique  development  work,  including  this  test  and  evaluation  program.  A  Forecast 
Evaluation  Working  Group  was  formed  by  the  government  to  specify  the  manner  in  which 
the  evaluation  was  to  be  conducted.  Its  membership  included  representatives  of  the 
U.S.  Air  Force,  the  Federal  Aviation  Agency,  and  the  U.S.  Weather  Bureau,  as  well  as 
the  contractors  concerned,  The  United  Aircraft  Corporation  and  The  Travelers  Research 
Center,  Me.  The  Working  Group  devised  a  plan  specifying  the  conditions  of  the  test. 
Among  these  conditions  were:  the  terminals  at  which  forecasts  were  to  be  made;  the 
forecast  lengths ;  the  valid  times  of  the  forecasts;  the  elements  to  be  forecast  and  the 


1 


class  limits  for  categorizing  these  elements;  the  year  for  which  forecasts  would  be 
evaluated;  and  the  method  for  scoring  the  forecasts.  This  plan  was  submitted  to  the 
Federal  Aviation  Agency  as  a  technical  memorandum  [5]  prepared  by  The  Travelers 
Research  Center,  Inc.  The  plan  was  accepted  by  the  FAA. 

1,2 _ Scope  of  the  Test 

The  test  was  designed  to  answer  two  questions : 

(a)  Mow  do  objective  statistical  forecasts  compare  with  operationally  available 
subjective  forecasts? 

(ib)  Which  of  the  statistical  techniques  tested  is  most  accurate? 

After  a  survey  of  the  available  statistical  techniques,  the  data  sources,  and  the  time 
available  for  the  test,  a  selection  of  techniques  and  data  was  made  and  approved  by  the 
government.  Five  techniques  were  agreed  upon,  and  thp  data  to  be  used  were  restricted 
to  surface  hourly  airways  observations. 

It  is  important  to  emphasize  that  this  test  only  establishes  an  important  and 
valuable  bench  mark  that  enables  the  assessment  of  how  well  we  can  presently  do  in 
terminal  forecasting.  Future  development  of  statistical  techniques  using  upper-air 
data,  derived  parameters,  and  additional  stations  and  data  may  well  produce  improved 
forecasts.  The  continuing  research  in  small-scale  dynamical  weather  prediction  and 
the  introduction  of  denser  networks  of  stations  may  also  contribute  significantly  to  our 
ability  to  improve  terminal  forecasting  capabilities. 


1,3  Organization  of  the  Report 


The  main  body  of  this  report  is  a  summary  description  of  the  test,  its  results,  and 
the  conclusions  to  be  drawn  from  it.  Appendix  A  is  a  nonmathematical  description  of 
the  Bryan  score.  A  supplement  to  the  report  contains  a  detailed  exposition  of  the  test; 
a  list  of  all  the  variables  presented  to  the  statistical  techniques  and  all  the  variables 
used  to  produce  the  statistical  forecasts;  and  all  the  contingency  tables  of  forecast  versus 
observed  values  and  verification  scores.  These  three  sections,  together  with  the  eom^ 
puter  programs,  which  are  available,  contain  sufficient  information  to  permit  other 
investigators  to  duplicate  any  or  all  parts  of  the  evaluation. 


2.0  DESCRIPTION  OF  THE  TEST  P IAN 


The  execution  of  the  test  plan  included  the  following  steps: 

(a)  the  specification  by  the  government  of  the  terminals,  elements,  forecast 
lengths,  limits  for  categorization,  statistical  techniques,  input  and  valid  times,  sub^ 
jective  forecasts,  and  verification  procedures, 

(b)  the  designation  by  the  government  of  a  year,  henceforth  called  the  evaluation 
year,  on  whose  data  all  techniques  would  be  tested, 

(c)  the  collection  and  processing  of  dependent  and  independent  (evaluation-year) 

data, 

(d)  the  development  of  the  statistical  forecast  techniques  on  the  dependent  data, 

(e)  the  production  of  statistical  forecasts  on  independent  data, 

(f)  the  collection  and  decoding  of  subjective  forecasts,  and 

(g)  the  verification  of  all  forecasts. 

2.1  Definition  of  Forecast  Length 

Figure  2-1  shows  how  forecast  length  is  defined. 


Input  time  is  the  time  of  observation  of  the  data  used  in  the  preparation  of  a 
terminal  forecast.  For  statistical  forecasts,  the  input  time  was  the  observation  time  of 
the  data  used  in  making  a  forecast.  For  subjective  forecasts,  such  as  FTs  and  TAFORs, 
the  input  time  is  uncertain  but  was  generally  assumed  to  be  the  whole  hour  preceding  the 
filing  time  of  the  forecast.  The  filing  time  is  the  time  by  which  a  forecast  must  be  given 
to  communications  personnel  to  ensure  meeting  communic ations  schedules.  For  example, 


Input 
time 

** - - — -  Forecast  length 


p-  —  — - -  Valid  period 

Initial 

time 


Valid 

time 


^Time  scale 


Final 

hour 


Fig.  2-1.  Definition  of  input  time,  forecast  length,  initial  time,  valid  time, 
and  valid  period  for  terminal  forecasts. 
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a  forecast  with  a  filing  time  of  1130  was  assumed  to  have  an  input  time  of  1100. 

Valid. period  is  the  specified  period  of  time  during  which  a  forecast  is  valid. 

For  example,  an  FT  1  forecast  is  delivered  to  communications  personnel  at  1130  and 
designated  as  a  12^-hr  forecast  starting  at  1200  and  ending  at  2400.  The  valid  period 
for  this  forecast  is  1200^-2400. 

initial  time  is  the  beginning  of  the  first  hour  of  the  valid  period.  In  the  example 
cited  above,  the  initial  time  is  1200. 


Final  time  is  the  end  of  the  last  hour  of  the  valid  period.  In  the  example  cited 
above,  the  final  time  is  2400. 

Valid  time  is  that  instant  of  time  at  which  the  forecast  value  of  a  parameter  is 
expected  to  occur.  For  example,  if  a  500-ft  ceiling  is  forecast  to  occur  at  1400,  the  valid 
time  of  the  forecast  is  1400. 


Forecast  length  is  the  difference  between  valid  time  and  input  time. 


2.2  Terminals,  Elements,  Forecast  Lengths,  and  Limits 

The  seven  terminals  approved  by  the  Working  Group  are  given  in  Table  2-1.  The 
forecast  elements  are  ceiling  and  visibility.  The  approved  forecast  lengths  for  each 
station  are  listed  in  the  table.  The  Group  decided  that,  for  the  purpose  of  verifying  the 
forecasts,  each  element  at  each  station  would  be  classified  into  five  categories.  The 
limits  for  categorization  are  also  listed  in  Table  2«^1. 

In  meteorology,  a  variable  for  which  a  forecast  is  required  is  often  termed  a 


TABLE  2-1 

TERMINALS,  ELEMENTS,  FORECAST  LENGTHS,  AND  LIMITS 
DESIGNATED  BY  THE  WORKING  GROUP  FOR  THE  EVALUATION 


5ta* 

Forecast 
lengths,  hr 

Upper  limit  of  ceiiing  categories 

,  t+ 

Upper  I  init  of  * 

isihiMty  categories,  mi 

1 

2 

3 

4 

5 

1 

3 

4 

5 

ACY 

3,5,7 

<200 

<  500 

<1000 

<3000 

^  un  I 

■1 

<i  ~ 

<? 

<  3 

s.  iini 

CEF 

2,3 ,4,6 

<200 

<  600 

<1500 

<5000 

£un  1 

1 

<i 

<3 

<  5 

£  un  i 

OCA 

2, 5,5, 7 

<200 

<  500 

<1000 

<3000 

£un  1 

1 

<i 

«,2 

<  3 

£  unl 

idl 

2,5,5,? 

<200 

| 

<  500 

<1000 

<3000 

^un  1 

■9 

<i 

<  2 

<  3  . 

^  un  1 

OFF 

2,4,6 

<300 

<1000 

<1500 

<5000 

£un  1 

■9 

<i 

<2 

<  5 

2=  un  1 

RND 

2,4,6 

<200 

<  1400 

<1500 

<5000 

Sun  1 

■a 

<i 

<3 

<  5 

2S  un  1 

WRI 

2,4,6  ! 

<200 

<  500 

<1500 

.  <5000 

£  un  1 

1 

<i 

<3 

<  5 

S  un  1 

*ACY  Atlantic  City  Airport  OFF  Offutt  Air  Force  Base 

CEF  Westover  Air  Force  Base  RND  Randolph  Air  Force  Base 

DCA  Washington  National  Airport  WRI  McGuire  Air  Force  Base 

I DL  Idlewild  International  Airport 
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predictand.  In  this  report,  a  predictand  is  defined  as  a  specific  element  at  a  specific 
station  for  a  specific  forecast  length.  For  example,  2-hr  ceiling  height  at  Idlewild  is  one 
predictand,  3-hr  ceiling  height  at  Idlewild  is  a  second  predictand,  3 -hr  Visibility  at  Idlewild 
is  another  predictand,  etc.  Table  2-1  lists  48  predictands,  comprising  two  elements  at 
seven  stations  for  either  three  or  four  forecast  lengths. 

2.3  Collection  and  Preprocessing  of  Data 

The  data  consisted  of  a  dependent  set  covering  the  10  years  from  1949  through  1958 
and  an  independent  set  (evaluation-year  data)  extending  over  the  year  from  October  1,  1960, 
through  September  30,  1961.  Standard  hourly  airways  observations  as  punched  from 
WBAN-10A  and  -JOB  forms  were  obtained  from  Asheville  on  IBM- 705  magnetic  tapes. 

There  were  approximately  96,000  surface  observations  for  each  of  53  stations,  where  an 
"observation”  contains  about  25  elements.  Thus,  some  125,000,000  pieces  of  data  were 
processed  in  carrying  out  the  test. 

2 .3  JL  Dependent  Data 

Data  on  the  IBM-705  tapes  were  edited  and  transferred  to  IBM- 7 090  tapes.  The  data 
were  then  processed  through  a  computer  program,  which  rearranged  the  data  from  observa¬ 
tion  to  vector  format,  wherein  a  single  element  at  a  single  station  forms  one  record  on  the 
tape.  The  same  program  adjusted  all  data  to  Eastern  Standard  Time,  categorized  ceiling 
heights  and  visibility,  and  added  constants  to  certain  elements  (such  as  temperature)  to 
ensure  that  every  value  was  positive,  and  all  data  were  "packed”  so  that  one  machine  loca¬ 
tion  contained  more  than  one  datum. 

Because  of  restrictions  on  the  size  of  computer  storage  and  the  need  for  speed  of 
operation,  not  all  data  for  the  complete  24  hours  of  each  day  for  the  10  years  of  the 
dependent  sample  (a  total  of  87,000  hours)  were  used.  A  random  selection  of  from  6,000 
to  10,000  hours  was  made  and  used.  Gross  error  checks  were  made  to  ensure  that,  for 
any  hour  chosen,  all  elements  in  all  observations  at  every  station  used  were  correct. 

Finally,  data  at  various  stations  were  combined  to  form  networks  of  stations,  one  net¬ 
work  for  each  of  the  seven  stations.  All  this  preprocessing  resulted  in  individual  computer 
tapes  for  each  of  the  seven  networks. 

2.3^2  Independent  Data 

The  independent  data  were  processed  similarly.  Because  these  data  were  in  slightly 
different  form,  some  additional  computer  programs  were  written.  A  set  of  data  to  match 
the  form  of  the  dependent  data  was  developed. 


2.4  Development  of  Statistical  Forecasting  Techniques 


The  Working  Group  selected  the  following  statistical  techniques  to  be  tested: 
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table 

STATION  NETWORKS  USED 
OF  TERMINAL  FORECASTING  TECHNIQUES 

Atlantic  Ci ty  WBAS,  'N. A • 

Lakehurst  NAS,  N-.J. 

McGuire  AFB,  N.i. 

Millville,  N.» J . 

Norfolk,  Va.i 
0 I ms tea d  AFB,  Pa. 
Philadelphia,  Pa. 

Salisbury,  Md. 

Washington  N.A.*  D.C. 

Wi Ikes  Barre/Scranton,  Pa. 


McGu i re  AFB,  N, J . 

Albany,  N.Y. 

Allentown,  Pa. 

Atlantic  City  WBAS,  N.J. 
Lakehurst  NAS,  N.J. 
Newark,  N.J. 

Norfolk,  Va. 
Philadelphia,  Pa. 
Providence,  R. I . 
Washington  N.A.,  D.C. 
Williamsport,  Pa. 

Randolph  AFB,,  Tex. 

Bergstrom  AFB,  Tex. 
Brownsvi I le,  Tex. 

Conna I ly  AFB,  Tex. 

Corpus  Christi,  Tex, 

El  I  ington  AFB,  Tex. 

Fort  Worth,  Tex. 

Lake  Charles,  La. 

Laredo  AFB,  Tex. 

Laugh  I i n  AFB,  Tex. 


2-2 

IN  THE  DEVELOPMENT 

for  The  (underlined)  test  stations 

Id  lew  t Id  I .A.,  N.Y. 

Albany,  N.Y. 

Binghamton*  sN^Y. 

Concord,  N.H. 

Lakehurst  NAS,  N*J. 

Olmstead  AFB,  Pa* 

Providence,  R.  I . 

Salisbury*  Md. 

Suffolk  County  AFB,  N.Y. 
Teterboro,  Ni*J. 

Windsor  Locks,  Conn. 

Offutt  AFB.  Neb . 

Des  Moines,  Iowa 
Grand  Island,  Neb. 

Huron,  S*D* 

Kansas  City,  Mo. 

Minneapolis,  Minn. 

Mo  I i ne,  I  I  I . 

North  PJatte,  Neb. 

Schi I  ling  AFB,  Neb. 

Sioux  Falls,  S.D. 

Springfield,  Mo. 

Wash ington  N .A , ,  D.C, 

Annapolis  NAF,  Md. 

Atlantic  City  WBAS,  N.J, 

Gordon svi I le,  Va. 

Martinsburg,  W.  Va. 

Norfo Ik,  Va. 

Patuxent  River  NAS,  Md. 
Pittsburgh,  Pa. 

Roanoke,  Va. 

Williamsport,  Pa. 


W estov er  AFB ,  Mass, 


Albany,  N.Y. 

Atlantic  City  WBAS,  N.J, 
Burlington,  Vt. 

Hanscom  Field,  Mass. 

Idlewi Id,  N.Y. 

Portland,  Me. 

Providence,  R. I . 

Syracuse,  N.Y. 

Wilkes  Barre/Scranton,  Pa. 
Windsor  Locks,  Conn. 
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itest”  .. 


(a)  persistence, 

(b)  climatological  expectancy  of  persistence  (CEP), 

(c)  grouping, 

(d)  Lund  contingency  prognosis, 

(e)  multiple-discriminant  analysis  (MDA),  and 

(f)  the  Lewis  technique. 

Each  technique,  except  persistence,  requires  the  preparation  of  forecast  tables,  equa¬ 
tions,  and/or  constants  from  a  dependent  sample  of  data.  For  each  test  station,  a  sur¬ 
rounding  network  of  stations  was  selected  and  prepared. 

2.4. 1  Test  Station  Networks 

The  test  station  networks  were  selected  by  a  committee  of  experienced  fore¬ 
casters.  The  choice  was  limited  to  the  data  available  on  magnetic  tape  as  received 
from  the  National  Weather  Records  Center,  With  the  additional  constraint  that  no  network 
could  exceed  11  stations.  As  closely  as  the  available  data  permitted,  the  network  stations 
were  selected  so  that  they  formed  two  concentric  circles  about  the  predictand  station. 

The  inner  circle  varied  from  25  to  100  mi,  and  the  outer  varied  from  125  to  about  250  mi. 
The  station  networks  are  listed  in  Table  2-2. 

2.4.2  Control  Techniques 

It  was  important  that  control  techniques  be  established  against  which  the  performance 
of  other  techniques  would  be  judged.  Separate  control  techniques  for  categorical  and 
probability  forecasts  were  designated.  Persistence  was  designated  as  the  control  tech¬ 
nique  for  categorical  forecasts  and  CEP  as  the  control  technique  for  probability  forecasts. 


2.4.2. 1  Persistence  Forecasts 

Persistence  forecasts  are  simple  statements  that  the  weather  at  the  valid  time  will 
be  the  same  as  the  weather  at  the  input  time.  No  preparation  of  forecast  tables  or  equa¬ 
tions  is  required. 


2.4. 2.2.  Climatological-exp  ectancy-of-persistence  Forecasts 


CEP  forecasts  are  persistence  forecasts  conditional  on  the  initial  conditions,  the 
season  of  the  year,  and  time  of  the  day.  Dependent  data  consisting  of  every  even  hour  in 
the  10  years  were  stratified  into  two  seasons-May  through  October  and  November  through 
April-and  then  into  two  diurnal  periods.  This  stratification  yielded  four  sets  of  data. 

For  each  set,  a  frequency-count  table  was  computed.  Table  2-3  is  an  example.  The  value 
26  in  the  first  row  represents  the  number  of  times  that  ceilings  below  200  ft  at  the  input 
time  were  followed  by  ceilings  below  200  ft  5  hr  later.  Other  entries  in  the  table  have 
similar  meaning. 
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TABLE  2-5 

FREQUENCY  COUNT  OF  IDLEWILD  CEILINGS* 


CeM  i  ng 

class  inter va \] ,  ft 

i 

1 

i  2 

;  3 

4 

i  5 

;  Total  ! 

C  <  200  ! 

1 

26 

34 

i? 

16 

53 

148 

206  C  <  500 

:  2  ’ 

40 

120 

!  70  ; 

66 

95 

389 

500  2S  C  <  1006 

;  3  • 

23 

97 

267 

175  ; 

:  167 

729 

,  1666  SG<  3606 

i  * 

6 

51 

159 

384 

467  ; 

1067 

3000  s  C 

;  5 

22 

61 

144 

|  421 

7886  i 

8534 

*5*  hr  forecast.  November- Apr i U  Input  hours  01 E  to  1 3E« 


TABLE  2-4 

CL  I MATOLOG I CAL- EXPECTANCY-OF-PERS I STENCE  FORECAST  TABLE f 


1 d 1 ewi Id  cei 1 i ng 
#lass  interval,  ft 

i 

1 

2 

3 

4 

5 

C  <  200 

1 

0.176 

0.230 

0.128 

0.108 

0.358 

200  6  C  <  500 

2 

0-103 

0.308 

0.180 

0. 1 70 

0.239 

500  tt  C  <  1000 

3 

0.032 

0.133 

0.366 

0.240 

0.229 

1000  £  C  <  3000 

4 

0.0Q6 

o.o48 

0.1 49 

0.360 

0.438 

3000  £  C 

5 

0.003 

0.007 

0.017 

0,o49 

0.924 

t5-hr  forecast.  November- Apr i I .  Input  hours  01 E  to  151. 


A  forecast  table.  Table  2-4,  was  obtained  by  dividing  each  entry  of  Table  2-3  by  its 
row  total.  Thus,  Table  2-4  contains  estimates  of  the  conditional  probability  of  occurrence 
of  each  category  of  ceiling  height  5  hr  later  when  the  category  at  the  input  time,  the  sea¬ 
son  of  the  year,  and  the  input  hour  are  given.  Four  tables  were  computed  for  each  of  the 
48  predictands.  These  tables  have  been  published  [1] .  A  forecast  is  made  by  entering 


the  five  estimated  probabilities. 
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2.4.3  Statistical  Techniques 

Four  statistical  techniques  were  selected  for  the  test.  A  description  of  these  tech- 
niques  is  given  in  a  technical  publication  [3J.  The  grouping,  Lund,  and  multiple  ^dis¬ 
criminant-analysis  techniques  produce  probability  forecasts  as  their  basic  output*  The 
Lewis  technique  produces  a  categorical  forecast.  In  the  case  of  the  Lewis  technique, 
forecasts  for  only  four  predictands  were  produced  for  the  evaluation  year  by  the  Elec¬ 
tronic  Computer  Branch,  1210th  Weather  Squadron,  Air  Weather  Service,  and  were 
delivered  to  The  Travelers  Research  Center  for  evaluation.  Forecasts  for  all  other  tech¬ 
niques  were  prepared  by  The  Travelers  Research  Center.  Because  of  the  limited  number 
of  predictands  treated  by  the  Lewis  technique,  their  evaluation  is  described  separately 
(see  Section  3.3.3). 

2.4.4  _ Subjective  Forecasts 


2.4. 4 . 1  Categoric al  Forecasts 

The  Forecast  Evaluation  Working  Group  chose  a  cross  section  of  subjective  fore¬ 
casts  currently  in  use  by  operating  agencies.  Terminal  forecasts  prepared  under  routine 
operational  conditions  during  the  evaluation  year  (October  1,  1960,  through  September  30, 
1961)  were  collected  and  decoded  at  The  Travelers  Research  Center  for  all  stations 
except  Atlantic  City.  The  Atlantic  City  forecasts  were  furnished  in  decoded  form  by  the 
U.S.  Weather  Bureau.  The  types  of  forecasts  used  at  each  station  are  given  in  Table  2-5. 


TABLE  2-5 

TYPES  OF  SUBJECTIVE  FORECAST 
USED  IN  EVALUATION 


Sta 

Type  of  forecast 

ACY 

FT1 

CEF 

TAFOR 

DCA 

FT2,  SPF* 

IDL 

FT1 ,  SPF* 

OFF 

TAFOR 

RND 

PLATFS 

WRI 

TAFOR,  SAGE 

^Special  probability  forecast. 
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FTls,  FT 2s,  and  TAFORs  are  the  usual  aviation  terminal  forecasts  prepared  by 
civilian  and  military  personnel.  SAGE  forecasts  are  special  2~hr  terminal  forecasts 
prepared,  at  present,  12  times  daily  by  military  forecasters  for  the  Air  Defense  Com¬ 
mand.  PLATFSs  are  aviation  forecasts  prepared  at  Kansas  City,  Missouri,  for  27  Mid¬ 
western  Air  Force  bases. 

Subjective  categorical  forecasts  were  decoded  for  only  42  of  the  48  predictands . 

2.4.4. 2  Probability  Forecasts 

Weather  Bureau  forecasters  at  Idlewild  and  Washington  National  Airports  prepared 
3-,  5%  and  7-hr  probability  forecasts  on  an  experimental  basis  as  part  of  this  test*  The 
forecasters  Were  instructed  to  use  data  up  to  and  including  the  input  time  and  to  prepare 
forecasts  not  later  than  1  hr  after  this  time* 


2. 4, 4, 3  ^  Decoding  Procedures 

Most  forecasts  were  received  in  the  format  used  in  the  communication  networks  and 
had  to  be  completely  decoded.  The  general  principle  applied  in  decoding  was  to  arrive  at 
what  the  forecaster  had  in  mind  at  the  instants  of  time  selected  as  the  valid  times .  The 
development  and  changes  in  weather  reflected  in  the  forecasts  were  assumed  to  be  linear, 
A  complete  description  of  the  decoding  procedure  may  be  found  in  the  supplement  to  this 
report. 


2.4.5  Input  and  Valid  Times 

The  input  times  and  valid  times  to  be  used  when  preparing  forecasts  on  the  inde¬ 
pendent  data  were  selected  by  the  Forecast  Evaluation  Working  Group  and  are  given  in 
Table  2-6.  SAGE  forecasts  are  issued  at  the  same  time  as  the  observation  from  which 
the  forecast  is  prepared. 

Forecasts  prepared  by  all  statistical  techniques  used  the  same  input  and  valid 
times  as  the  subj  ective  forecasts  with  which  they  were  compared. 

2.4.6  Method  of  Verifying  Forecasts 

The  Forecast  Evaluation  Working  Group  specified  the  method  for  verifying  the 
forecasts.  The  plan  [5]  approved  by  the  FAA  stated: 

Forecasts  prepared  for  the  PTET  [Plan  for  Test  and  Evaluation 
of  Terminal  Forecasting  Techniques]  will  be  expressed  in  categorical 
and/or  probabilistic  terms.  For  the  categorical  forecasts,  a  score  devised 
by  Bryan  (1961)  will  be  obtained  from  each  contingency  table  of  forecast  vs. 
observed  values  and  will  be  used  to  compare  forecast  techniques.  Various 
other  scores  such  as  percent  correct,  skill  scores  of  various  kinds,  pre- 
figuranee  and  post- agreement  will  also  be  computed.  For  the  probabilistic 
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TABLE  2-6 

INPUT  AND  VALID  TIMES  FOR  SUBJECTIVE  FORECASTS 


B 

Forecast 

Valid  time, 
EST 

Type 

length ,  hr 

ACY 

;  m 

;  3,  5,  7 

05 

:  08,  10,  12 

17 

•  20>  22,  00 

CEF 

TAFOR 

: 

2,  4,  6 

05  : 

07,  09,  11 

17 

:  19,  21,  23 

CCA 

wmamm 

1 

3,  5,  7 

05 

mmm 

- - - — 

17 

20,  22,  00 

, 

IDL 

: 

FT2 

and 

SPF* 

3,  5,  7 

05 

08,  10,  12 

11 

17 

20,  22,  00 

23 

02,  o4,  06 

OFF 

TAFOR 

2,  4,  6 

05 

1 

r . 17 

RND 

H 

[  07 

09,  11,  13 

19 

wamm 

WRI 

SAGE 

2 

HH 

07 

17 

19 

TAFOR 

2,  4,  6 

05 

07,  09,  11 

17 

19,  21,  23 

*$pee 1  a  I  ppgbab i 1 1 ty  forecast 


forecasts,  the  P-score,  described  by  Brier  and  Allen  (1951),  will  be  calculated 
and  compared  using  the  t-test  for  the  means  of  paired  observations. 

This  plan  was  followed.  Brier- Allen  P-seores  and  Bryan  scores  for  each  station,  and 
averages  over  the  seven  stations,  are  presented  in  this  report. 

A  variety  of  scores  for  categorical  forecasts  other  than  that  devised  by  Bryan  was 
considered  by  the  Working  Group.  The  Bryan  score  was  designated  as  the  primary  scoring 
procedure  for  categorical  forecasts.  The  Working  Group  requested  that  other  scores  be 
computed  in  the  process  of  evaluation.  For  the  purpose  of  measuring  these  scores,  all 
probability  forecasts  will  be  converted  to  categorical  forecasts  by  applying  a  loss  function 
designed  to  maximize  the  scores.  These  scores  will  be  presented  in  the  supplement. 

2.4. 6.1  Brier^Allen  P-scOre 

For  a  single  probability  forecast,  the  P-score  is  defined  as 

A  2 

p  *  I  (f.  -  E  )  ,  (2-1) 

i=l 

where  Ej  takes  the  value  1  or  0  according  to  whether  the  predictand  occurs  in  class  i  or 
not;  f^,  f^,  f  ,  f^,  and  f^  represent  the  forecast  probabilities.  A  P-score  of  0  indicates  a 
perfect  forecast;  the  poorest  score  is  2,  which  occurs  when  a  probability  of  1  is  assigned 
to  other  than  the  correct  forecast.  In  comparing  two  forecasts  of  the  same  observed 
quantity,  the  lower  P-score  indicates  the  better  forecast.  The  P-score  is  widely  used 
and  accepted  for  verifying  probability  forecasts. 


2.4,6. 2  Paired- comparison  t-test 

Extreme  care  was  taken  to  ensure  that  exactly  the  same  forecasts  were  made  in  any 
comparison  of  forecast  techniques.  That  is,  if  a  5 -hi*  subjective  ceiling  forecast  was  made 
at  05  Eastern  Standard  Time  (EST)  October  12,  1960,  then  statistical  forecasts  were  made 
also  at  this  time.  There  were  po  exceptions .  This  matching  of  forecasts  must  be  done  to 
ensure  a  valid  comparison  of  forecast  techniques.  It  also  permits  use  of  the  t-test  for 
paired  comparisons.  This  tests  whether  the  mean  P-score  for  one  forecast  technique  is 
significantly  better  than  the  mean  P-score  of  another  technique. 

The  test  value  is 


t 


-  *>2 


N  (N  -  1) 


-1/2 


(2=2) 


where  d  is  the  difference  between  the  mean  P-scores,  dj  is  the  difference  between  individ¬ 
ual  P-scores,  and  N  is  the  number  of  forecasts  made.  The  P-scores  of  two  forecast  tech¬ 
niques  tend  to  be  highly  correlated  because  poor  scores  tend  to  be  made  by  both  techniques 
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on  difficult  forecast  situations  and  good  forecasts  tend  to  be  made  on  easy  situations. 

The  power  of  this  widely  used  test  lies  in  eliminating  the  correlation  by  taking  differences, 
and  this  can  be  done  only  when  the  forecasts  are  paired. 

2.4JL3  Bryan  Score 

For  a  single  categorical  forecast,  the  Bryan  score  is 

B  =  W..,  (2-3) 

where  Wjj,  is  the  merit  or  demerit  ascribed  to  a  forecast  of  category  i  when  category  j 
occurs.  The  W- values  are  designed  to  distinguish  between  levels  of  skill  in  forecasting 
the  occurrence  or  nonoccurrence  of  category  5  and,  also,  in  forecasting  which  non- 5 
category  will  occur*  In  developing  the  score,  it  was  taken  for  granted  that  forecasting 
methods  would  be  compared  by  the  t-test.  Because  the  Bryan  score  in  its  corrected  form 
has  not  been  published  previously,  a  non  mathematical  description  is  given  in  Appendix  A 
and  a  mathematical  description  will  be  given  in  a  separate  publication.  * 


2.5  Generation  ofCategorical  Forecasts  from  P robabilitv  Forecast s 


The  statistical  forecast  techniques,  except  for  the  Lewis  technique,  produce  only 
probability  forecasts.  To  compare  these  forecasts  with  persistence,  Lewis  technique, 
and  subjective  forecasts  (which  are  in  categorical  form  only),  it  is  necessary  to  use  the 
probability  forecasts  to  generate  categorical  forecasts.  If  the  forecasts  were  perfect, 
there  would  be  no  question  of  the  method  of  generation.  The  correct  category  would  be 
given  by  a  probability  of  1  every  time,  and  that  would  be  the  categorical  forecast.  In  the 
presence  of  uncertainty  in  the  forecasts,  the  decision  is  not  this  simple. 


The  Bryan  score  specifies  a  loss  function  and  provides  a  means  for  arriving  at 
forecasts  on  the  basis  of  maximizing  gain  or,  in  this  case,  maximizing  the  score.  One  of 
the  important  features  of  probability  forecasts  is  that  they  lend  themselves  to  such  treats 
ment.  Therefore,  the  categorical  forecasts  for  all  statistical  techniques  are  generated  in 
such  a  fashion  as  to  maximize  the  Bryan  score.  When  other  scores  are  computed,  they 
also  will  be  maximized. 

This  is  done  as  follows.  Let  f2,  fg,  f^,  and  fg  be  the  forecast  probabilities.  Five 
quantities  are  computed: 


-  i  wuV 

j-i  •  - 


5 

^  I 


2j  ] 


5 

^  t 


Wc.f.. 
5]  J 


(2-4) 


The  maximum  G  gives  the  categorical  forecast.  Thus,  if  Q ±  is  largest,  category  1  is  fore¬ 
cast;  etc.  The  W -values  are  the  same  as  those  in  Eq.  (2-3). 


*Bryan,  J.  G.,  Scoring  System  for  Categorical  Forecasts  of  Ceiling  and  Visibility. 
TRC  Tech.  Rpt.  7044-59  (FAA  Tech.  Publication  26)  (to  be  published). 
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2 .6 _ Summary  of  the  Test  Plan 

The  test  was  designed  to  compare  the  forecasting  accuracy  of  the  statistical  tech¬ 
niques  with  one  another  and  with  subjective  forecasts  prepared  under  current  operating 
conditions.  A  secondary  objective  was  to  evaluate  the  accuracy  of  subjective  probability 
forecasts,  which  are  not  now  prepared  in  routine  operation*  The  control  technique  for 
categorical  forecasts  was  persistence  and,  for  probability  forecasts,  climatological 
expectancy  of  persistence. 

A  dependent  sample  of  10  years  of  data  was  used  to  develop  the  statistical  te6h« 
niques.  Categorical  and  probability  forecasts  Were  produced  with  these  techniques  on 
one  year  of  independent  data.  Subjective  categorical  forecasts  for  the  same  year  Were 
collected  and  decoded.  Special  subjective  probability  forecasts  were  made.  The  for e- 
casting  skills  of  all  probability  forecasts  Were  compared  by  means  of  the  P-score; 
categorical  forecasts  were  compared  by  means  of  the  Bryan  score. 

Three  statistical  techniques  produced  probability  forecasts:  grouping,  Lurid,  and 
multiple-disc  riminant  analysis*  These  were  compared  on  48  predictands:  ceiling  and 
visibility  forecasts  at  Idlewild  and  Washington  National  for  2,  3,  5,  and  7  hr;  at  Westover 
AFB  for  2,  3,  4,  and  6  hr;  at  Atlantic  City  for  3,  5,  and  7  hr;  and  at  McGuire,  OffUtt,  and 
Randolph  AFB  for  2,  4,  and  6  hr. 

Special  subjective  probability  forecasts  were  prepared  for  ceiling  and  visibility 
at  Idlewild  and  Washington  National  for  3,  5,  and  7  hr.  These  were  compared  with 
forecasts  made  with  the  three  statistical  techniques. 

Subjective  categorical  forecasts  were  collected  for  42  of  the  48  predictands.  The 
predictands  omitted  were  ceiling  and  visibility  at  Idlewild  and  Washington  National  for 
12  hr  and  at  Westover  for  3  hr.  Categorical  forecasts  for  the  42  predictands  were  gener 
ated  from  the  probability  forecasts  of  the  three  statistical  techniques.  The  categorical 
forecasts  for  the  six  techniques  were  compared.  A  separate  evaluation  of  the  Lewis 
technique  was  necessitated  by  the  small  sample  of  forecasts  available. 
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3 .0  SUMMARY  OF  TEST  RESULTS 


3.1 _ Probability  Forecasts  by  Statistical  Techniques 

Probability  forecasts  of  all  48  predictands  were  made  by  three  statistical  techniques 
and  the  control  technique  (climatological  expectancy  of  persistence,  CEP)  and  were  verified 
by  the  P-score.  These  P-scores  are  given  in  Table  3*1*  Examination  of  Table  3-^1  shows 
that  the  multiple^ discriminant- analysis  (MDA)  technique  yielded  the  best  score  on  44  of 
the  predictands  and  CEP,  the  control  technique,  yielded  the  best  score  on  the  other  4.  The 
paired^comparison  t^test  probabilities  were  combined  by  Fisher’s  method  [4]  to  obtain 
a  single  significance  test  for  all  48  predictands.  The  MDA  P*seores  were  significantly 
better  than  those  of  CEP  beyond  the  1%  level  of  significance. 

An  index  (I)  of  the  amount  of  increase  or  decrease  in  forecasting  skill  of  statistical 
techniques  relative  to  the  control  technique  is 

I  =  :  f  x  100,  (3—1) 

CEP 

where  Pq^P  locate  CEP  and  statistical-technique  P-scores  respectively. 

Because  0  is  a  perfect  P-score,  pQgp  -  0  is  the  total  amount  of  forecasting  skill  not 
accounted  for  by  CEP.  Therefore,  I  is  the  ratio  of  the  improvement  or  deterioration  in 
forecasting  skill  of  a  statistical  technique  relative  to  the  control  in  terms  of  forecasting 
skill  remaining  to  be  accounted  for.  The  factor  100  puts  I  into  percentage  form. 

Values  of  I  were  computed  for  the  '  'average-over- station"  P-scores  of  Table  3-1 
and  are  presented  in  Table  3-2.  The  results  displayed  in  Tables  3-1  and  3-2  and  the 
associated  t-tests  indicate  that  MDA  probability  forecasts  of  2-  tp  7-hr  ceiling  and 
visibility  are  significantly  better  than  probability  forecasts  made  either  by  the  other  two 
statistical  techniques  or  by  the  control  technique,  as  measured  by  the  Brier- Allen  P-score. 

3.Jj  BryanScores  of  Categorical  Forecasts  by  All  Techniques 


The  probability  forecasts  of  the  four  statistical  techniques  (including  the  control 
technique)  were  used  to  generate  categorical  forecasts  for  42  of  the  48  predictands  in 
such  a  way  as  to  maximize  the  Bryan  score  (see  Section  2.5).  Categorical  persistence  and 
subj ective  forecasts  for  the  same  42  predictands  were  also  available.  The  forecasting 
skill  of  each  technique  on  every  predictand  was  measured  by  the  Bryan  score,  and  the 
results  are  given  in  Table  3-3.  Paired- comparison  t-tests  were  made  between  the  scores 
attained  by  each  technique  and  every  other  technique. 

Table  3-3  indicates  that  the  Bryan  scores  for  MDA  were  highest  on  29  of  the  42 
predictands,  those  for  subjective  forecasts  were  highest  on  6  predictands,  and  those  for 
the  other  techniques  were  highest  on  the  remaining  7  predictands.  The  average  Bryan 
scores  for  all  predictands,  given  in  Table  3-3(c),  indicate  that  MDA  scores  were  highest 
and  CEP  and  subjective  scores  were  approximately  equal  and  ranked  second.  The  paired- 
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:  TABLE  3- 1 

P- SCORE*  TEST  RESULTS  FOR  THREE  STATISTICAL  TECHNIQUES 
FOR  EVALUATION-YEAR  DATA 


(a)  The  predictand  element  is  ceiling 


Fred i ctand 

*  P- score  . 

Sta 

Fcst 

length >  hr 

1  fcsts 

CEP 

1  Group 

Lund 

MDA 

BBjjM 

El 

1 

3 

2 

3 

2 

5 

2 

3 

2 

2 

2 

682 
.  651 

661 

s 

1,454 

1,290 

697 

695 

726 

0.1783 
0.21 78 
0.2681 
0.1222 
0.1349 
0.1816 
0.1825 
0.1700 
0.2937 
0.2583 

0.21 48 
0.21 53 
0.2572 
0.1255 
0.1372 
0.1885 
0.1849 
0.1769 
0.2949 
0.2401 

0.3164 
0.2897 
0.2577 
0.2199 
0.241 7 
0.2082 
0.3446 
0.2588 
0.3484 
0.2628 

0.1755 
0.1947  ■ 

0,2445 

0.1172 

0.1210 

0.1732 

0.1701 

0.1602 

0.2707 

0.2208 

Mean 

■B91 

8,054 

0.2007 

0.2035 

0.1848 

ACY 

f 

652 

0.2106 

0.2098 

■sus 

0.1969 

CEF 

k 

657 

0.2956 

0.2897 

0.3471 

0.2465 

DCA 

5 

546 

0.1368 

0.1360 

0.2290 

0.1213 

IDL 

? 

1 ,457 

0.2452 

0.2385 

0.2838 

0.2171 

OFF 

4 

656 

0.2514 

0.2541 

0.2854 

0.1945 

RND 

4 

681 

0.3508 

0.3808 

0.4261 

0.5246 

WRI 

610 

0.2907 

0.2781 

0.3190 

0.2396 

Mean 

5,259 

0.2513 

0.3212 

0.2201 

ACY 

7 

676 

0.2614 

jm 

0.3681 

0.2284 

CEF 

6 

672 

0.3335 

0.3405 

0.3715 

0.2854 

DCA 

7 

68 2 

0.1681 

0.1616 

0.2(26 

0.1 4oe 

IDL 

7 

1,459 

0.2788 

0.2649 

0.2857 

0.2578 

OFF 

6 

0.2448 

0.2627 

0.2017 

RND 

6 

679 

0.3872 

0.4362 

0.4949 

0.3559 

WR| 

6 

608 

0.3271 

0.3501 

0.4299 

0.281 0 

Mean 

5,444 

0.2858 

0.2933 

0.3493 

0.2501 

♦Lower  score  indicates  better  forecast. 
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(b)  The  predictand  element  is  visibility 


Predictand 

No  •  of 

P- score 

Sta 

Fest 

fcsts 

CEP 

Group 

‘length,  hr 

ACY 

3 

7OO 

0.1895 

0.2095 

0.2477 

0.1937 

CEF 

2 

:  685 

0.2443 

0.2093 

;  0.3852 

0.1768 

CEF 

3 

679 

:  0.2308 

0.2620 

0.2254 

DCA 

2 

687 

0,1519 

0,1530 

0.1 51 4 

0.1394 

DCA 

3 

61  i 

0.1237 

0.1282 

0.1296 

0.1208 

IDL 

2 

1,366 

0.1424 

0.1 488 

0.1802  ; 

0.1553 

1 DL 

3 

1,343 

;  0.1530 

'  0*1616 

;  0.1516 

■  0.1 425 

OFF 

2 

698 

0,1107 

0.1102 

0.1332 

0.1099 

RND 

2 

703 

.  0.1194 

0.1259 

0.2255 

0.1116 

WRI 

2 

728 

0.2619 

0.2593 

0.2722 

0.2361 

Mean 

2-3 

8,200 

0.1707 

0.1737 

0.2139 

0.1612 

ACY 

5 

66U 

0. 1 468 

0.1530 

0.1605 

0.1506 

CEF 

4 

6?6 

0.2454 

0.2580 

0.3750 

0.2306 

DCA 

5 

699 

0.1116 

0.1197 

0.1276 

0.1101 

IDL 

1,31? 

6.1580 

0.1639 

O.I96I 

0,1515 

OFF 

k 

698 

0.1253 

0.134? 

0.1298 

0.1212 

RND 

k 

685 

0.0807 

0.0991 

0.2521 

0.0779 

WRI 

4 

607 

0.304$ 

0.3086 

0.3054 

0.2707 

Mean 

wsm 

5,344 

0.1767 

0.2209 

0.1589 

ACY 

7 

657 

0.1474 

0.1529 

0.2558 

0.1366 

CEF 

6 

642 

0.2682 

0.2837 

0.4550 

0.2671 

DCA 

7 

71 4 

0.0970 

0.1 047 

0.1  o48 

0.0933 

IDL 

7 

1,345 

0.1875 

0.1962 

0.1893 

0,1788 

OFF 

6 

622 

0.1124 

0.1167 

[  0.1789 

0.1099 

RND 

6 

683 

0.0907 

0.0973 

0.281 4 

0.0843 

WRI 

6 

60S 

0.2763 

0.3070 

0.31  71 

O.2678 

Mean 

6.7 

5,271 

0.1685 

- - —  - - 

0.1798 

0.2546 

0.1625 

(e)  Composite  of  (a)  and  (b)  for  all  stations  and  forecasts 


No.  of  : 
fcsts 

, 

.  ■! 

Mean  P- score 

'  CEP 

Group 

Lund 

MDA 

37,572 

o.2o47 

0.2106 

0.2689 

0.1875 
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TABLE  3-2 

PERCENT  IMPROVEMENT  (i)  OF  P- SCORES  RELATIVE  TO  CIP 
FOR  EVALUATION-YEAR  DATA 


Pred ictand 

No.  of 
fcsts 

1,  % 

E  !l  em 

!  fcst 

i  1  ervgTfr,  h  r 

,  Group 

Lund 

MDA 

C 14 

1  2-3 

8054 

-1.4 . 

-36.9 

:  7.9  1 

Cl  0  1 

*1-5  | 

5259 

-1.6 

-27.8 

12.4 

CIS 

6-7 

5444 

:  -2.6 

-22.2  : 

12.5 

VIS 

2-3  1 

8200 

:  -1.8  ! 

;  -25.3  ! 

5.6 

VIS 

4-5 

5344 

!  -5.5 

!.  -31  .9 

5.1 

VIS 

6-7  ; 

i  5271 

_ _  -  _ 

:  -6.7 

-51.1  i 

3.6 

comparison  t=test  showed  that  the  MI)  A  scores  were  statistic  ally  significantly  higher 
than  CEP  and  subjective  scores.  At  the  5%  level,  MDA  scores  were  higher  than  CEP 
scores  on  18  of  the  42  predictands  and  higher  than  the  subjective  forecast  scores  on  13 
predictands.  Fisher's  test  [4];  indicated  that,  for  all  forecasts  combined,  the  scores 
achieved  by  the  MDA  technique  Were  statistically  significantly  better  than  those  of  either 
CEP  or  subjective  techniques,  beyond  the  1%  level. 

The  average  Bryan  score  for  a  very  large  number  of  perfect  forecasts  would  be 
1.0.  However,  for  any  limited  number  of  forecasts,  the  maximum  attainable  score  must 
be  computed  from  the  observed  frequencies  of  the  various  categories.  The  index  1  defined 
in  Eq,  (3-1)  now  becomes 


1 


B 


M 


B. 


x  100, 


(3-2) 


where  Bg  is  the  average  Bryan  score  achieved  by  a  forecast  technique,  Bc  is  the  average 
Bryan  score  achieved  by  the  control  technique,  and  BM  is  the  maximum  average  Bryan 
score  attainable  within  the  sample  of  forecasts.  Values  of  I  were  computed  for  the 
"average- over-station1'  Bryan  scores  taken  from  Table  3-3  and  are  presented  in  Table  3-4. 
MDA  yielded  the  highest  value  of  the  index  I. 
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TABLE  3-3 

BRYAN* SCORE*  TEST  RESULTS  FOR  CATEGORICAL  FORECASTS 
FOR  EVALUATION- YEAR  DATA 


*h'igher  score  indicates  better  forecast. 


(b)  Predtc+and  element  is  visibility 


Predietand 

Bryan  score 

Sta 

Fcst 

rests 

Pets 

Sub  j 

j  CEP  ! 

!  'Group 

:  Lund 

MDA 

length,  hr 

ACY 

3 

699 

0.242  1 

i  0.254 

j  0.208 

:  0.288 

0.285 

i  0.281 

CEF 

2 

682 

0.4o4 

0.489 

j  0.357 

0.539 

0.067  ■ 

0.644 

DCA 

3 

609 

Oil  86 

0.216 

'  0*495  ; 

0.288 

:  0.183 

0.498 

IDL 

3 

1,336 

0.298 

0.426 

0.393  ; 

0.376 

i  0.364 

0.535 

OFF  : 

2 

694 

:  0.356 

0.366  : 

0.284 

6.294 

;  0.319 

0.382  . 

RND 

2 

7G2 

;  0.331 

0.310  1 

0.296 

0.423 

0.492 

6.547 

WRI 

2 

612 

0.529 

0.513  ; 

0.543 

0.512 

o.46o 

Mean 

2-3 

5,334 

0.358  ! 

0.368 

0.389 

0.310  1 

i  0.496  1 

ACY 

3 

663 

0.094 

.  0.221 

!  6.095 

SSI 

0.108  : 

CEF 

4 

672 

0.324 

6.331 

0.304 

6.343 

0.386 

O.43O 

mZM.. 

5 

696 

0.185 

0.087 

0.376 

0.305 

O.472  : 

IDL 

3 

1  >3t! 

0.180 

6.248 

0.338 

0.270 

6*186 

!  6*359 

OFF 

4 

693 

0.295 

0*277  1 

0.332 

0.396 

0.364 

RND 

4 

682 

6.1 48 

0.217 

0.220 

0.154 

0.310 

WRI 

4 

51 4 

6*283 

0.299 

0*323 

0.290 

0*377 

o.4io 

Mean 

4-5 

5,231 

i  0.216 

Ml 

0.276 

0.249  : 

0480 

;  0.359 

ACY 

7 

657 

;  WB 

o.o64 

:  6.033 

0.051 

0.052 

CEF 

6 

639 

mMm 

0.246 

0.372 

0.298 

6.307 

DCA 

7 

713 

0.110 

o.i43 

0.272 

-0.008 

0.274 

IDL 

7 

1,338 

0.129 

0*233 

0.278 

0.649 

0.275 

0.347 

OFF 

6 

618 

0.184 

0.085 

0.191 

0.271 

0.267 

6.307 

RND 

6  I 

682 

0.108 

0.174 

0.093 

0.096 

0*193 

WRI 

6  j 

m 

0.198 

0.262 

0.234 

0.254 

0.248 

Mean 

6-7 

5,164 

0.137 

0.153 

0.194  ; 

0.189 

0.176 

0.247 

(c)  Composite  of  (a)  and  (b)  for  all  stations  and  forecasts 


No.  of 
fests 

Mean  Bryan  score 

1 

Sub  j 

CEP 

Group 

- 1 

Lund 

MDA 

31,372 

0.284 

0.312 

0.313 

0.306 

0.303 

0.389 

TABLE  3-1+ 

PERCENT  IMPROVEMENT  (I)  OF  BRYAN  SCORES  FOR  CATEGORICAL  FORECASTS 
•RELATIVE  TO  PERSISTENCE  FOR  EVALUATION- YEAR  DATA 


Predi  stand 

!  i.  % 

flern 

Fest 

length,  hr 

Sub  j 

CEP 

Group 

Lund 

■ 

Hi 

CIG 

2-3 

-  3.2 

j  -0.7  ; 

-5.i  ; 

-4.8 

9.9 

CIG 

4-5 

5.6  . 

1.8 

-5.4  ; 

3.5  i 

1 4.2  ; 

CIG 

6-7 

:  i4.7  ; 

•  2.6 

j  7-0 

6.1  ! 

14.7  . 

VIS 

2-3 

3.9 

6.4 

19.2 

VIS 

I 

7.7 

4.2 

18.2 

VIS 

6.9 

6.3 

15.3 

3.3 _ Special  E V aluati ons 

A  number  of  special  or  e^jeri mental  evaluations  were  either  requested  by  the 
Working  Group  or  were  required  because  of  peculiarities  encountered  in  the  course  of  the 
evaluation.  Among  these  were  the  SAGE  forecast  evaluation  requested  by  the  Air  Force, 
the  evaluation  of  special  subjective  probability  forecasts  made  by  the  U.  S.  Weather  Bureau, 
and  the  evaluation  of  terminal  forecasts  made  by  the  Lewis  technique. 

3.3,1  SAGE  Forecasts 

Subjective  SAGE  2^hr  ceiling  and  visibility  forecasts  at  McGuire  AFB  were  collected 
and  decoded.  These  were  categorical  forecasts  and,  therefore,  were  compared  with 
categorical  forecasts  produced  by  all  other  techniques,  including  the  subjective  TAFOR 
made  for  the  same  station.  The  Bryan  scores  are  displayed  in  Table  3-5. 

3.3*2  Subjective  Probability  Forecasts 

Special  subjective  probability  forecasts  were  prepared  at  Washington  National  and 
Idle  wild  Airports  by  U.S.  Weather  Bureau  forecasters.  The  predictands  for  both  stations 
were  3-,  5^,  and  74ir  visibility  and  ceiling.  The  instructions  given  to  the  forecasters  were 
that  the  forecasts  were  to  be  made  not  later  than  1  hr  after  the  input  time.  These  probabil-* 
ity  forecasts  were  verified  and  compared  with  other  techniques  by  means  of  the  P-score. 
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TABLE  3-5 

BRYAN  SCORES  FOR  THE  EVALUATION  OF  594  TWO- hr  SAGE  TERMINAL  FORECASTS 

for  mcguire  afb 


F  I  em 

Bryan  score 

:  SAGE 

Pers 

Sub  j 

CEP 

Group 

Lund  ■ 

MDA 

CIG 

o.4o3 

0.451  : 

.  c.4n  ■ 

0.432  ; 

0,483 

c.4ii  : 

:  0.545  : 

VIS 

6,347  : 

0.542  ; 

0.526  ; 

0.556 

,  0.524  : 

0.476  ; 

0.600 

In  addition,  categorical  forecasts  were  made  from  these  probability  forecasts  in  such  a 
manner  as  to  maximize  the  Bryan  score.  These  were  verified  by  means  of  the  Bryan 
score. 

3 . 3 . 2 . 1  Brier- Allen  P -score  V erific ation 

The  Brier-Alien  P-scores  for  the  12  predictands  are  given  in  Table  3-6  for  the 
subjective  probability  forecasts  and  for  corresponding  forecasts  made  by  four  statistical 
techniques.  The  P-scores  for  subjective  ceiling  forecasts  are  lower  (better)  than  any  of 
the  statistical  techniques  except  MDA.  This  is  an  interesting  result  when  it  is  realized 
that  the  subjective  forecasters  did  not  have  much  experience  in  preparing  probability  fore¬ 
casts.  The  paired-comparison  t-test  showed  that,  for  all  ceiling  forecasts  combined, 

MDA  P-scores  are  statistically  significantly  better  than  those  of  the  subjective  forecasts 
beyond  the  1%  level.  The  P-scores  for  the  subjective  probability  forecasts  of  visibility 
were  not  as  good  as  those  of  the  statistical  techniques. 

3. 3. 2. 2  Bryan-score  Verification 

The  subjective  probability  forecasts  were  converted  to  categorical  forecasts  by 
maximizing  the  Bryan  score.  The  Bryan  scores  for  the  12  predictands  for  five  techniques 
are  given  in  Table  3-7.  These  subjective  categorical  forecasts  yielded  higher  scores  than 
any  other  technique,  for  both  ceiling  and  visibility.  The  paired-comparison  t-test  indicated 
that  the  improvement  in  Bryan  scores  of  the  subjective  forecasts  over  those  of  the  second- 
ranking  technique  (MDA)  is  statistically  significant  beyond  the  2%  level. 

These  results  differ  from  the  results  obtained  with  the  P-score  verification.  This 
may  be  because  the  Bryan  score  puts  great  emphasis  upon  correct  forecasts  in  the  non-5 
categories,  whereas  the  P-score  treats  the  categories  equally.  Thus,  probability  fore¬ 
casts  that  are  quite  good  at  specifying  probabilities  in  the  non-5  categories  but  not  so  good 
in  category  5  will  receive  a  poorer  P-score  but  a  better  Bryan  score. 
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TABLE  3-6 

EVALUATION  OF  SUBJECTIVE  PROBABILITY  FORECASTS: 

SRI ER-ALLEN  P- SCORES*  FOR  12  PREDICTAN'DS  FOR  EVALUATION -.YEAR  DATA 

(a)  Predi stand  element  is  cel  I ing 


Predietand 

Brier- Al len  P- score 

Sta 

— 

FCst 

length,  hr 

NO.  ©f 
tests 

Subj 

CEP  | 

Group 

Lund 

MDA  • 

OCA 

3  ] 

539 

0.1371 

j  0.1359 

0.1382  , 

0.2435 

0.1219  ; 

OCA 

5  i 

540 

0.1436 

0.1375 

0.1373  ! 

I  0.2273 

0.1225 

DCA 

7 

677 

0.1688 

0.1688 

0.1627 

:  0.2041 

0.l4l2 

IDL 

3 

1  1 ,189 

0.1648 

;  0.1796  | 

0.I808 

'  0.34o8 

0.1651  | 

« DL 

✓ 

1 ,332 

6,2282 

0.241J  1 

0.2530  ; 

0.2763 

0.2111 

IDL 

7 

1,333 

0.2547 

0.2740 

[ 

0.2589 

0.2781 

0.2507 

Mean 

3-7 

5,610 

0.1829 

0.1895 

0.1852 

0.261 7 

0.1688 

(b)  Predietand  element  is  visibility 


1 . -  v  -  ' - - - 1 

|  Predietand 

— - — . — * 

No.  of 
tests 

Brier* A  lien  P- score 

Sta 

CEP 

Group 

Lund 

MDA 

DCA 

3 

605 

0.1347 

0.1250 

0.1294 

0.1309 

0.1220 

DCA 

5 

694 

0.1330 

0.1124 

0.1206 

0.1284 

0.1109 

DCA 

7 

708 

0.1237 

0.0951 

0.1C29 

0.1029 

0.0914 

1  DL 

5 

1,230 

0.1487 

0.1 409 

: 

0.1492  > 

0.1 4oi 

0,1297 

IDL 

5 

1,208 

0.1732 

0.1513 

0.1579 

0.1876 

0.1457 

1  DL 

7 

1,231 

0.1825 

0.1729 

0.1816 

0.1754 

0.1 650 

Mean 

3-7 

5,676 

0,1493 

0.1x9 

0.1 403 

0.1 442 

0.1275 

♦Lower  score  indicates  better  forecast. 


TABLE  5-7 

EVALUATION  OF  EXPERIMENTAL  SUBJECTIVE  CATEGORICAL  FORECASTS : 
BRYAN  SCORES*  FOR  12  PREDICTANDS  FOR  EVALUATION* YEAR  DATA 


(a)  Predictand  element  is  ceiling, 


Predictand 

No.  of 
fcsts 

Bryan  score 

H 

Fcst 

length,  hr 

Pers 

Sub  j 

Spec ial 
suib  j 

CEP 

:  MDA  | 

DCA 

3 

557 

0.3*3  . 

0.559 

0,599 

0.272 

0.393  ; 

DCA 

5 

538 

0.252  1 

:  0.264 

0.551 

0.270 

0.338 

DCA 

7 

676  ' 

O.I65 

0.252 

0,3?0  , 

0.205 

0.308 

1  DU 

3 

1 ,182 

0.428 

0.438 

0.477  . 

0.431 

0.459 

i  DL 

3 

1 ,526 

0.387 

A 

jy\ 

0*503 

G.3S5 

0.460 

1 DL 

7 

1,32? 

0.262 

0.353 

0.464 

0.273 

0.360 

Mean 

5-7 

5,586 

0.305 

o.54o 

0-427 

0.306 

0,386 

(b)  Predictand  element  is  visibility 


Predictand 

No.  of 
fcsts 

Bryan  score 

Sta 

 ... 

Fcst 

length,  hr 

Pers 

Sub  j 

Special 
sub  j 

CEP 

MDA 

DCA 

3 

603 

0.188 

0.218 

0.586 

0.500 

0,50? 

DCA 

s 

691 

0.186 

0.088 

o.4o9 

0.378 

0.475 

DCA 

7 

707 

0.112 

0.063 

0.260 

0.1 44 

0.277 

IDL 

3 

1,223 

0,285 

0.385 

0.501 

0.324 

0.480 

5 

1 ,2ce 

0.168 

0.243 

0.381 

0.340 

0.350 

7 

1 ,225 

0.126 

0.216 

0.4l  1 

0.238 

0.318 

Mean 

3-7 

5,651 

0.178 

0.202 

0.425 

0,321 

0.4oo 

♦Higher  score  indicates  better  forecast. 
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TABLE  3-8 

COMPARISON  OF  TWO  TYPES  OF  SUBJECTIVE  CATEGORICAL  FORECAST t 

Composite  contingency  tables  for  fc  predictands 


(a)  Categor i  ca I  forecasts  made  by 
subjective  forecasters  under 
operat I  mg  cond i  t i ons 


Observed 

-  . 

Total 

'mmmm 

mi 

t 

- 

3 

4  ; 

; 5 

Forecast 

1 

i? 

mm 

2 

2 

7 

30  ; 

2 

51 

n 

4i 

20 

69 

259 

| 

29 

67 

B9 

107 

M 

— 

11  : 

31  ; 

89 

n 

317 

B 

m 

44 

39  | 

91 

263 

BUM 

9,811 

Tota  1 

l4? 

222 

363 

595 

9,910 

11,237 

(b)  Categorical  forecasts  generated  from 

spec i a  I  sub  ject i ve  probab i I i ty  forecasts 
to  maximize  number  of  hits 


Hi 

Observed 

L - -  -  - .  — -  -  -  -  — - 

Tota  l 

1 

3 . 

m 

"  5 

Forecast 

1 

32 

18 

^  -  --  - _ =Jl 

12 

3 

20 

85 

Q| 

■mi 

44 

mm 

42 

24 

71 

260 

3 

mpfl 

60 

B9 

108 

112 

450 

m 

■a 

24 

82 

218 

299 

629 

5 

43  " 

4i 

79 

242 

9f4o8 

9,813 

Total 

l4y 

;  222 

363 

595 

9,910 

11,237 
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3. 3. 2. 3  C omparison  of  Two  Types  of  Subjective  Forecast 

It  is  clear  from  Table  3-7  that  the  Bryan  scores  for  subjective  forecasts  generated 
from  forecast  probabilities  are  better  than  those  of  the  subjective  categorical  forecasts 
prepared  routinely.  To  gain  further  information,  additional  categorical  forecasts  were 
produced  from  the  subjective  forecast  probabilities.  This  was  done  by  specifying  that 
the  categorical  forecast  is  that  given  by  the  category  with  the  highest  probability.  That 
is,  if  f-£,  fg,  fg»  £4,  and  f5  are  forecast  probabilities,  then  the  category  With  the  highest  f 
is  the  categorical  forecast.  This  method  of  generating  categorical  forecasts  is  designed 
to  maximize  the  number  of  hits.  Table  3-8  compares  a  composite  contingency  table  over 
all  12  predictands  for  this  type  of  subjective  categorical  forecast  with  a  similar  table  for 
routinely  prepared  subjective  categorical  forecasts.  The  number  of  hits  is  9807  for  the 
routine  subjective  forecasts,  and  9385  for  the  categorical  forecasts  generated  from  the 
probabilities. 

3.3.3  Lewis-technique  Forecasts 

Forecasts  on  the  evaluation-year  data  Were  made  by  the  Electronic  Computer  Branch, 
1210th  Weather  Squadron,  Air  Weather  Service,  and  delivered  to  The  Travelers  Research 
Center,  Inc.,  for  verification.  The  only  forecasts  included  were  those  made  once  a  day, 
at  0500  EST,  for  5-  and  7-hr  ceiling  and  2-  and  5-hr  visibility  at  Washington  National  Airport. 
The  Lewis  technique  produces  only  categorical  forecasts.  Therefore,  the  forecasts  were 
verified  by  means  of  the  Bryan  score.  The  results  are  presented  in  Table  3-9. 

The  Bryan  scores  for  ceiling  forecasts  produced  by  the  Lewis  technique  are  higher 
than  those  produced  by  persistence,  subjective,  CEP,  or  Lund,  about  the  same  as  those 
produced  by  Mi) A,  and  lower  than  those  produced  by  grouping.  The  paired^comparison 
t-test  indicates  that  none  of  the  scores  for  ceiling  forecasts  is  significantly  different  from 


TABLE  5-9 

EVALUATION  BY  BRYAN  SCORES  OF  LEWIS  TERMINAL  FORECASTS 
FOR  WASHINGTON  NATIONAL  AIRPORT  ' 


Pred ictand 

No.  of 
f  Gsts 

Bryan  scores 

E  l  em 

Fcst 

length,  hr 

Lew  i  s 

Pers 

Subj 

CEP 

Group 

Lund 

MDA 

CM3 

5 

0.4l6  ~ 

0.262 

~"o.294 

0.325  “ 

0.415” 

0.321 

0.573 

CM3 

7 

3M 

0.275 

0.224 

0.223 

0.303 

0.454 

0.254 

0.327 

Mean 

^  i 

2-7 

5 66 

0.546  • 

0.245 . 

0.259 

0.314 

0.435 

0.288 

0.350 

VIS 

2 

0.459 

0.509 

* 

,  0.462 

0.821 

0.498 

1  -3?7 

’  VIS 

5 

542 

0.874 

0.225 

0.137  j 

0.658 

0.393 

0.435 

0.667 

Mean 

2-5 

686 

0.367 

0. 366 

-  1 

0.560 

0.o07 

0.467 

0.997 

*No  forecasts  ayilabte. 


any  other  score  at  the  5%  level.  This  is  am  indication  that  the  sample  is  too  small  to 
allow  any  reasonable  judgment,  Lewis  visibility  scores  were  lower  than  those  produced 
by  the  other  techniques,  with  the  exception  of  persistence  and  subjective  forecasts.  A 
significant  difference  at  the  1%  level  was  found  only  between  the  Lewis  scores  and  the 
MDA  scores. 
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APPENDIX  A.  DESCRIPTION  OF  THE  BRYAN  SCOPE* 

Both  eeiling  and  visibility  have  been  subdivided  into  five  operationally  significant 
classes.  Each  predict  and,  therefore,  can  yield  just  25  possible  combinations  of  observed 
and  forecast  classes.  The  problem  of  scoring  the  categorical  forecasts  is  one  of  deciding 
upon  a  quantitative  merit  or  demerit  for  each  combination.  Concensus  has  proved  so 
difficult  to  achieve,  however,  that  it  has  been  agreed  to  compromise  on  a  scoring  system 
developed  mathematically  from  reasonable  but  arbitrary  assumptions.  The  assumptions 
themselves  have  been  arrived  at  through  a  gradual  process  of  exploration  and  amendment. 

With  probability  of  occurrence  heavily  concentrated  in  one  class,  it  was  found 
that  the  prescribed  conditions  in  their  original  form  couM  not  be  satisfied  simultaneously, 
but  that  an  amended  set  of  conditions  should  be  satisfied  simultaneously.  The  original 
set  of  conditions  will  be  described  first,  and,  afterward,  the  amended  set. 

1.  The  first  assumption  was  that  demerits  should  be  progressive.  If  the  error 
of  forecasting  class  2  when  \n  fact  class  1  is  observed  receives  the  demerit  ^d^,  and 
the  error  of  forecasting  class  3  when  class  2  is  observed  receives  the  demerit  then 
the  error  of  forecasting  class  3  when  class  1  is  observed  (being  regarded  as  the  sum  of 

the  two  errors)  receives  the  demerit  ^(d  +  d  ).  It  is  not  necessary  under  this  condition 

1 

(but  it  is,  as  a  consequence  of  other  conditions)  to  assume  symmetry.  That  is,  com*' 

ceivably  the  error  of  forecasting  class  1  when  class  2  actually  occurs  need  not  receive 

the  same  demerit  as  the  error  of  forecasting  class  2  when  class  1  actually  occurs. 

The  general  pattern  of  merits  (x  ,  x  ,  ...)  and  demerits  (-d  ,  -6  ,  -d  ,  -6  ,  ...)  is  dis- 

12  I ,  \  Zt  4> 

played  in  Table  A-l.  If  the  x's  are  to  have  the  effect  of  merits  and  the  -d’s  and  -d's  are 
to  have  the  effect  of  demerits,  no  x,  d,  or  5  can  be  negative. 

2,  The  second  assumption  was  that  if  forecasts  produced  by  a  given  procedure 
are  distributed  independently  of  the  observed  weather,  the  forecasts  should  receive,  in 
the  long  run,  an  average  score  of  zero.  This  assumption  was  based  on  the  fact  that  such 
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TABLE  A- 1 

ASYMMETRICAL  PATTERN  OF  MERITS  AND  DEMERITS 


Class  ; 

Forecast 

i  1 

2 

3 

4 

5  • 

Observed  | 

1 

xi  ; 

■<*1 

-<d1  +  &  ; 

;  -(<*1  +  dg  +  dj) 

i(d|  +  *2  +  d3  +  d4) 

2 

:6x 

X? 

id2 

”  (^2  +  d3l 

;  - (dg  +  dj  +  d4) 

3 

+  5g) 

"52 

: . . x?_  \ 

+  CI4) 

'  4 

s(51  +  <5g  + 

-(og  +  63) 

i<53 

x4  : 

__  iL4 

5 

-(5^  +62  +  ^5  *  64) 

-($2  +  63  +  64) 

.  -((5  3  +  <54  ) 

‘*4 

x5 

forecasts  are  technically  devoid  of  information,  in  the  sense  of  reducing  uncertainty. 

3.  The  third  assumption  was  a  corollary  of  the  second.  A  purely  random,  but 
climatologically  realistic  forecast,  being  statistically  independent  of  the  actual  weather, 
cannot  yield  uncertainty- reducing  information  on  any  observed  category.  Hence  the 
third  assumption  was  that  if  a  population  of  such  forecasts  were  subclassified  according 
to  the  respective  categories  of  the  observed  weather,  the  average  score  of  those  random 
forecasts  should  be  zero  in  each  category  of  the  observed. 

4.  The  fourth  assumption  was  based  on  consideration  of  a  statistical  test  by  which 
two  forecasting  methods  could  be  compared.  With  the  purpose  of  making  the  statistical 
test  of  comparative  merit  as  sensitive  as  possible,  an  optimization  criterion  was  imposed. 
In  general,  the  most  sensitive  test  would  have  to  utilize  knowledge  of  the  joint  probabili^ 
ties  with  which  any  two  methods  yield  any  possible  combination  of  forecasts  for  a  given 
observed  category.  Such  knowledge  is  unavailable,  but  in  its  place  a  particular  pair  of 
forecast  types,  representing  the  extremes  of  merit,  were  chosen  as  the  basis  of  optimiza¬ 
tion.  These  were  the  perfect  forecast  and  the  random  climatological  forecast.'  The  fourth 
assumption,  then,  was  that  the  merits  and  demerits,  subject  to  other  conditions,  should  be 
determined  so  as  to  maximize  the  value  of  the  statistic  defined  by  the  t-test  for  compar ac¬ 
tive  merit  of  the  perfect  and  random  climatological  forecasts.  Obviously,  if  these  two 
forecasts  could  be  identified  respectively  as  perfect  and  random,  in  advance,  no  test  would 
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TABLE  A-2 

SYMMETRICAL  PATTERN  OF  MERITS  AND  DEMERITS 


Class 

Forecast 

1 

2 

3 

4 

5 

Observed 

1 

X1  j 

-(dj  +  dg) 

„:ldJ  +  *g  +  a3> 

-(d1  +  *2  +  +J*4)  1 

2 

-d1 

-dg 

-(dg  +  dy  +  djj) 

3 

a<0|  +  dg) 

...  . -a3 . 

-(d,  +  d4) 

4 

+  ^3) 

-(«2+'a3) 

:  _ “’J . 

x4 

_  _ ~d4 

5  J  -(d,  +  dg  +  dj  +  d*) 

-(dg  +  dj  +  djj) 

^(d3  +  d^)  • 

~d4  . 

:  x5 

be  necessary;  but  they  are  used  only  as  a  basis  of  optimizing  scores  on  which  other  kinds 
of  forecast  may  be  judged. 

5.  The  fifth  assumption  was  that  the  perfect  forecast  should  receive,  in  the  long 
run,  an  average  score  of  unity.  Any  other  arbitrary  positive  constant,  such  as  100%, 
would  satisfy  mathematical  necessities  just  as  well;  but  some  arbitrary  constant  has  to 
be  chosen,  to  establish  the  scale  of  measurement.  Unity  was  chosen. 


The  combined  effect  of  assumptions  2  and  3  was  to  produce  a  symmetrical  pattern 

of  demerits;  that  is,  the  mathematical  consequence  was  to  make  6  =  d^,  <5 =  d^,  $3  =  dg, 

and  <5  =  d  .  The  resulting  scheme  is  shown  in  Table  A-2. 

4  4 

The  foregoing  five  assumptions  constitute  a  self-consistent  system  of  conditions, 
provided  that  the  probabilities  of  the  separate  weather  classes  are  not  too  disparate. 

With  nearly  every  predictand  in  our  data,  however,  they  are  too  disparate.  Excepting 
only  visibility  at  Atlantic  City,  the  logical  requirement  that  the  d's  be  nonnegative,  in 
order  that  the  -d’s  have  the  effect  of  demerits,  forced  all  but  d4  to  take  the  value  zero. 
Thus  the  scoring  scheme  for  every  predictand  but  visibility  at  Atlantic  City  reduced  to 
the  form  illustrated  in  Table  A-3.  As  to  the  exception,  there  are  two  nonvanishing  terms, 
d  and  d^;  but  the  magnitude  of  d^  is  too  small  to  have.an  appreciable  effect  on  the  scores. 
The  only  distinction  is  that  with  the  other  predictands,  d  is  precisely  zero,  whereas  here 


d  is  nearly  zero. 

With  d  ,  d  ,  and  d  reducing  to  zero,  the  merits  x  ,  x  ,  ...,  x  are  determined  under 
12  3  1  2d 

assumption  2  as  fixed  multiples  of  d  ,  thus  precluding  any  attention  to  assumption  4.  This 
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TABLE  A-3 

EVANESCED  PATTERN  OF  MERITS  AND  DEMERITS 
FORCED  BY  VAN  I  SHI  NG  OF  dp  dg,  and  dj 


Forecast 

c 

Mass 

!  1  1 

% 

- ; 

3  ! 

4  ; 

5 

1 

xi. 

0 

0 

0 

.  i 

!  <u 

2 

0 

I  X2 

0 

0 

!  *d4 

> 

U 

<1) 

i  in  . 

i  3 

0 

0 

x3  : 

0 

,  Jd4 

-G 

© 

!  4 

0 

o 

0 

x4 

:  -d4 

5 

_d4 

■  _d4 

"d4  ' 

-d4 

r  x5 ; 

exclusion  of  assumption  4  can  be  avoided  by  revising  assumptions  1,  2,  and  3.  When  this 
is  done,  ail  assumptions  of  the  amended  system  are  compatible. 

Before  stating  the  revisions  of  assumptions  1,2,  and  3,  the  exploration  of  another 
avenue,  via  assumption  1,  will  be  described.  The  pattern  of  Table  A-3  suggests  the 
possibility  that  the  build-up  of  demerits  under  assumption  1  might  have  something  to  do 
with  narrowing  the  only  permissible  solution  to  d  -  0,  d  =  0,  and  d  =  0.  Another 
scoring  pattern  that  would  allow  some  distinctions  to  be  made,  without  differentiating 
among  as  many  degrees  of  error  as  Table  A- 2,  is  displayed  in  Table  A-4.  Unfortunately, 
when  the  other  assumptions  were  kept  the  same,  the  outcome  was  identical:  d  ==  0,  d  = 
0,  and  dg  =  0. 

By  this  time,  it  seemed  appropriate  to  start  with  the  pattern  of  Table  A-3  as  the 
revised  assumption  1,  and  seek  a  modification  of  assumptions  2  and  3  that  would  preserve 
their  original  intent  as  far  as  possible  and  yet  leave  room  for  statistical  optimization.  A 

clue  may  be  gleaned  from  the  probability  distributions,  shown  in  Table  A- 5,  where  p  ,  p  , 

x  z 

p  are  the  respective  empirical  probabilities  of  the  five  weather  classes  at  each  sta- 

3 

tion.  In  view  of  the  great  predominance  of  class  5,  it  was  concluded  that  the  intent  of 
assumptions  2  and  3  could  be  served,  to  a  practical  approximation,  by  confining  the  appli- 


TABLE  A- 4 

ESTED  PATTERN  OF  MERITS  AND'  DEMERITS 


forecast 

1 

2 

3 

MOI 

5 

1 

xi 

: _ "di_  j 

m 

HU 

[ 

X2 

B9 

— 

ES 

"d2 

— 

H^H 

x4 

*d4 

'd4 

'  ^  ; 

^d4 

X* 

J 

' - —  — 

Table  a- 5 

EMPIRICAL  PROBABILITIES 


Pi 

p2 

P3 

p4 

H  ■ 

0.028 

o.o4o 

O.o 51 

0.083 

0.792  ’ 

0.031 

0.019 

0.054 

0.037 

0.879 

0.017 

0.029 

0.083 

0.186 

0.685 

0.020 

0.021 

0.069 

0.080 

0.810 

o.oo4 

0.015 

o.o48 

0.076 

0.857 

0.005 

0.005 

0.016 

0.032 

0.942 

0.010 

0.029 

0.052 

0.087 

0.822 

0.010 

0.015 

0.021 

0.033  ; 

0.923 

0.008 

o.o47 

0,034 

0.134 

0.777 

0.007 

0.009 

0.027 

0.033 

0.924 

0.016 

0.022 

0.132 

0,1 47 

O.683 

0.010 

0.008 

0.C25 

0.025 

0.932 

0.020 

0.058 

0.084 

0,139 

0.719 

0.022 

0.024 

0.089 

0.096 

0.769 
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cation  of  original  assumption  2  to  class  5  and  relaxing  assumption  3  to  the  extent  of  giving 
a  random  climatological  forecast  an  overfall  average  score  of  zero  rather  than  by  require 
ing  the  average  score  to  be  zero  separately  in  each  observed  class,  taken  together,  these 
assumptions  make  the  average  score  of  a  random  climatological  forecast  zero  in  observed 
class  5  and  also  zero  in  the  other  four  classes  combined.  There  are  six  constants  and 
three  equations  of  constraint;  hence  there  are  three  degrees  of  freedom  left  for  optimiza¬ 
tion.  An  informationless  forecast  produced  by  the  best  theory^ of ^game s  pure  strategy 
can  earn  a  nonzero  average  score,  but  the  maximum  average  is  small.  In  this  connection, 
it  is  well  to  remember  that  the  t>test  eliminates  the  effect  of  a  common  base  in  paired 
comparisons,  and  that  probability  forecasts  actually  take  advantage  of  the  game -theory 
principle  but  apply  it  to  presumably  sharper  probabilities. 

The  revised  assumptions  are  as  follows. 


1.  The  scheme  of  merits  and  demerits  are  as  displayed  in  Table  A-6. 

2.  Any  procedure  that  produces  forecasts  of  class  5  independently  of  the 
occurrence  of  class  5  receives  a  population  average  score  of  zero  on  observed  class  5. 

3.  Taken  over  all  classes,  a  random  climatological  forecast  receives  a  population 
average  score  of  zero. 


TABLE  A-6 

FINALIZED  PATTERN  OF  MERITS  AND  DEMERITS 


Class 

Forecast 

1 

2 

- - - - 

3 

;  4 

5 

-o 

CD 

> 

© 

o 

1 

XT 

0 

0 

0 

2 

0 

V 

0 

0 

■  -Y 

5 

0 

0 

X3 

0 

:  *Y 

It 

0 

0 

0 

:  x4 

-  Y  ~~ 

rr 

-Y 

-Y 

-v 

^Y 

~V 
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TABLE  A- 7 

NUMERICAL  VALUES  OF  MERITS  AND  DEMERITS 


S+a 

- n 

Elem 

V 

X2 

:  X3 

x4 

Y 

ACY 

CIG 

4.296 

4.550 

4.432 

4.569 

0.090 

,  0.542 

ACY 

VIS 

7*991 

7.887 

8.020 

8.050 

o.o4o 

;  0.290 

CEF 

CIG 

2.357 

2.38I 

2.500 

2.766 

0.246  ! 

0.535 

CEF  : 

VIS 

4. 64  2 

4.648 

4.896 

4.958 

0.090 

0.385 

OCA 

CIG 

6.188 

6.262 

6.478 

6.671 

!  0.074 

i  0.446 

DCA 

VIS 

16.335 

16.588 

-16. 565 

l6.84o 

0.025  1 

o.4o4 

IDL 

CIG 

•  4.922 

5.021 

.  5.1 46 

5-346 

.  0,087 

'  o.4o4 

1 DL 

VIS 

12.462 

12.497 

12.601 

12.769 

0.027 

0.323 

OFF 

CIG 

3.635 

3.778 

'  3.728 

4.i4i 

•  0.145 

0.506 

OFF 

VIS 

12.489 

12.520 

12.746 

12.837 

0.029 

0.355 

RHO 

CIG 

2.336 

2.399  . 

2.667 

2.711 

0.250 

:  0.495 

RND 

VIS 

1 4.247 

14.203 

14.462 

:  14.468 

0,024 

0.325 

WRI 

CIG 

2.857 

2.909 

3.054 

3.245 

i 

0.174 

0.445 

WRI 

VIS 

3.687 

3.697 

3.966 

3.997 

0,122 

0.407 

4.  Subject  to  the  stated  constraints,  the  merits  and  demerits  are  determined  so  as 
to  maximize  the  statistic  defined  by  the  t-test  of  the  difference  between  the  average  scores 
of  perfect  forecasts  and  random  climatological  forecasts. 

5.  The  perfect  forecast  receives  a  population  average  score  of  unity.  (However, 
the  sample  average  value  need  not  equal  unity.) 

The  merits  and  demerits  for  each  predictand  are  exhibited  in  Table  A- 7.  The  maximum 
average  scores  obtainable  from  an  informationless  pure  strategy  are  shown  in  Table  A-8. 


TABLE  A- 8 

MAXIMUM  AVERAGE  SCORES  OF 
INFORMATIONLESS  PURE  STRATEGY 


Sta 

■  ■ 

E  lem 

Max  av  score 

ACY 

CIG 

0.108 

ACY  ; 

VIS 

0.043 

GIF 

CIG 

0. 1 48 

CfF 

VIS 

0.085  ' 

:  DCA 

CIG 

0,115 

DCA 

m| 

1 

0.158 

IDL 

1 

0.133 

IDL 

VIS 

0.123 

j 

OFF 

CIG 

0.162 

OFF 

vis 

0,096 

RND 

CIG 

0.062 

RND 

VIS 

0.061 

CIG 

0.131 

VIS 

0.071 

A  sample  layout  of  the  scoring  scheme,  indicating  how  the  merits  and  demerits 
(Table  A-7)  should  be  interpreted,  is  shown  in  Table  A-9.  This  illustration  exhibits  the 
scoring  scheme  for  We  stover  ceiling. 

A  feature  of  the  merits  that  might  be  puzzling  is  that,  in  classes  1  through  4  of  a 
given  predictand,  the  merits  are  slightly  greater  for  the  more  probable  classes.  The 
merits  and  demerits  are  the  consequences  of  several  mathematical  conditions.  The  main 
effect  of  these  conditions,  on  the  kind  of  probability  distribution  we  are  dealing  with,  is 


TABLE  A- 9 

EXAMPLE  OF  SCORING  SCHEME 
FOR  CEILING  AT  WESTOVER  AFB 


1 

Forecast 

UI  dbb 

1 

t 

3 

4 

j  5 

m 

2.357 

i  0 

0 

1  0  ; 

I  -0.555  ; 

X) 

| 

0 

2.581 

0 

0 

mm 

> 

Is 

<D 

(/) 

H 

mm 

o  : 

2.500  , 

0 

*0.555 

-Q  1 

O 

KB 

0 

0 

0 

2.766 

'  -0.555 

BM 

■  m 

MMH| 

1 

*0*555 

-0.555 

to  make  the  merits  nearly  uniform  in  classes  1  through  4;  but  a  minor  effect  is  to  make 
them  increase  slightly  with  increasing  probability  pf  the  classes.  The  scores  have  been 
designed  for  comparing  skill,  and,  among  the  first  four  categories,  those  that  occur  more 
often  afford  greater  opportunity  for  making  comparisons.  As  it  turns  out,  the  merits  in 
classes  1  through  4  are  roughly  equal  to  the  reciprocal  of  the  probability  of  the  non- 
occurrence  of  class  5.  Broadly  speaking,  the  general  operation  of  the  scoring  scheme  is 
to  measure  skill  in  distinguishing  between  the  occurrence  and  nonoceurrence  of  class  5 
and  at  the  same  time  to  compare  hits  in  classes  1  through  4. 
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